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ABSTRACT 

External-memory dictionaries are a fundamental data struc- 
ture in file systems and databases. Versioned (or fully- 
persistent) dictionaries have an associated version tree where 
queries can be performed at any version, updates can be per- 
formed on leaf versions, and any version can be 'cloned' by 
adding a child. Various query /update tradeoffs are known 
for unversioned dictionaries, many of them with matching 
upper and lower bounds. No fuUy-versioned external-memory 
dictionaries are known with optimal space/query /update 
tradeoffs. In particular, no versioned constructions are known 
that offer updates in o(l) I/Os using 0{N) space. We 
present the first cache-oblivious and cache-aware construc- 
tions that achieve a wide range of optimal points on this 
tradeoff. 

General Terms 

Cache-oblivious algorithms, External-memory algorithms, Ver- 
sioned data structures 

1. INTRODUCTION 

We study tradeoffs between space, query cost and update 
cost for versioned external- memory dictionaries. A versioned 
dictionary stores keys and their values with an associated 
version tree, and supports the following operations: 

• update (key, value, version): associate value to the 
key in the specified leaf version 

• query (start, end, version): return every key in the 
range [start, end] together with the value written in the 
closest ancestor to version 

• clone (version) : create a new version as a child of the 
specified version 

A versioned dictionary can be thought of as efficiently im- 
plementing the union of many dictionaries: the 'live' keys at 
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version v are the union of all the keys in ancestor versions, 
where if a key appears more than once, its closest ancestor 
takes precedence. If the structure supports arbitrary ver- 
sion trees, then we call it (fully-) versioned; if it supports 
only linear version trees, we call it partially- versioned. We 
are interested in fully- versioned structures. We use A'^ to 
denote the total number of keys written; for a version v, we 
use Nv to denote the number of keys that are live at v. 

The B-tree [2J is the classic external memory dictionary. 
More recently, data structures that achieve a wide range of 
query/update tradeoffs have been discovered, in particular, 
those that offer updates in o(l) I/Os while increasing query 
cost slightly are of great practical interest. 

We aim to answer the following open questions: 

1. Can one achieve optimal 0{N) space with the same 
query/update bounds as a CoW B-tree? 

2. Can one achieve other points on the tradeoff curve? 

3. Can these be achieved in both the DAM and CO mod- 
els? 

Even ignoring updates, it is already difficult to efficiently an- 
swer range queries with little space. For deep version trees, 
many keys in the range may not have been updated since 
the root version, while some may have been updated many 
times since then. It is easy to see that some elements must 
be replicated many times for range queries to be asymp- 
totically optimal - a construction that achieves this while 
balancing asymptotically optimal space, query and update 
costs is our main contribution. 

As a warm-up, consider the following two naive implemen- 
tations: keeping a B-tree of the latest key for each version 
gives excellent query performance, but at the expense of 
space and update costs. In contrast, keeping a single B-tree 
with elements ordered by (key, version) uses optimal 0{N) 
space but a small range query may be forced to scan all the 
elements in 0{N/B) I/Os. 

1.1 Unversioned query/update tradeoffs 

The B-tree [2] is the classic external-memory dictionary. 
The B-tree is typically analyzed in the disk access machine 
(DAM) model [14] : this assumes an internal memory of size 



M and an arbitrarily large external memory where each lO 
can read or write a block of B elements. An A'^-node B-tree 
supports updates in 0{logg N) such I/Os and range queries 
returning Z elements in 0(logg A + Z/B) I/Os. An im- 
portant characteristic of the B-tree is that it is optimal for 
searching within the DAM model. 

It has been observed that there is a tradeoff between query 
and update performance, and that B-trees achieve only one 
point on this tradeoff. The buffered-repository tree (BRT) 
[9] supports updates in amortized 0{log N/ B) I/Os and 
queries in O(logA) I/Os. Hence, searches are slower in the 
BRT than in the B-tree, whereas updates are significantly 
faster. More generally, the B'^-tree of Brodal and Fagerberg 
[S] supports a large part of this tradeoff: for < e < 1, the 

B'^-tree supports updates in amortized 0(— ^^t^|-— ^) I/Os 
and searches in 0(log^e+iA) I/Os. Thus, when e = 1 it 
matches the performance of a B-tree, and when e = 0, it 
matches the performance of a BRT. An interesting interme- 
diate point is when e — 1/2; then searches are slower by a 
factor of roughly 2, but updates are roughly VB jl faster 
than a B-tree. 

Similar results are known for the cache-oblivious (CO) model 
[12] . The CO model is similar to the DAM model, except 
that the block size B is unknown to the algorithm and can- 
not be used as a tuning parameter. The COLA of Bender 
et al. [5] achieves the same tradeoffs as the BRT in the 
CO model. More recently, Brodal et al. [7j presented a CO 
algorithm that achieves the same range of tradeoffs as the 
_B^-tree. It is worth noting that all these schemes achieve 
the optimal 0{N) space bound - it has not been necessary 
to use more space in order to achieve the tradeoffs in either 
model. 



1.2 Versioned query/update tradeoffs 

No similar tradeoffs are known for versioned dictionaries, ei- 
ther in the DAM or CO models. In fact, matching these 
bounds in the CO model with fast updates is impossible 
- Afshani et al. [T] showed that any partially-versioned 
CO dictionary supporting range queries in 0{{logg Ny{l + 
Z/B)) I/Os for any c > must use r2(A(log log A)"") space, 
where £ > depends on c. In their model, every update 
creates a new version (hence there are N versions) . By con- 
trast, in our model, we explicitly track a version tree V and 
new versions are created with a clone operation (and we as- 
sume that the version tree can fit entirely in memory; thus, 
it is perhaps more appropriate to describe our solution as 
'semi-external memory'). 

The classic versioned analogue of the B-tree is the copy- 
on- write (CoW) B-tree [TD], which is based on the 'path- 
copying' technique originally presented by DriscoU et al. [11] 
for making internal-memory data structures fully-persistent, 
but it does not apply efficiently to external-memory struc- 
tures. It supports updates to version v in 0(log^ A„) I/Os 
and range queries of size Z in 0(log^ A„ -I- Z/B) I/Os. 
Clearly, these query bounds are the best we can hope for, 
since 0(logg A„) is the bound we would get if we were to 
isolate all the keys accessible from version v and store them 
in a B-tree. This data structure is fundamental to every 
NetApp filer [13], the ZFS file system [6], and in numerous 



file systems and databases. The basic idea is to use a B-tree 
with many roots, one for each version. A lookup proceeds 
as in a B-tree, starting from the appropriate root. An up- 
date to key k at version v goes as follows; if there is a root 
node for v, perform a regular B-tree update for k starting at 
that root node; otherwise, find the root node for w's parent 
version and perform a regular B-tree lookup for k to find 
the correct leaf node, then duplicate this entire root-to-leaf 
path, and finally set the root node of this path as the root 
node for version v. 

The CoW B-tree has two major problems that we seek to ad- 
dress: first, it is not space-optimal - in general, each update 
may cause a new path to be written, giving Q{N B\ogg N) 
spac43 and second, it does not offer any update/query trade- 
offs. The 'multiversion B-tree' (MVBT) of Becker et al. 
[3 achieves 0(logg A„) I/Os for updates and queries with 
0{N) space, but is only partially-versioned and does not 
support any other tradeoffs. 

1.3 Our results 

We present the first fully- versioned dictionaries that achieve 
optimal 0{N) space, and optimal query/update tradeoffs in 
both the DAM and CO models. One can see them as fuUy- 
versioned analogues of the B'^-tree [8] and the COLA [5] in 
the DAM and CO models respectively. 

In the DAM model, we present an external-memory ver- 
sioned dictionary using space 0{N) that supports updates to 
version v in amortized 0{ ) I/Os and supports range 

queries of size Z in worst-case (9(i2£a_!Xii + ^) I/Os. 

In the CO model, we present a cache-oblivious external- 
memory versioned dictionary that uses space 0{N) and sup- 
ports updates to version v in amortized 0(log A„/_B) I/Os, 
and range queries at version v returning Z elements in amor- 
tized 0(log N+Z/B) I/Os. We can deamortize the structure 
so that updates run in worst-case 0(log A„) I/Os (with the 
same amortized bound), and point queries at version v run 
in worst-case 0(log A„) I/Os. Similarly to Bender et al. [5], 
with knowledge of B ('cache-aware'), the data structure can 
support updates to version v in amortized 0{ fjl : ) I/Os, 
and range queries of size Z in amortized (9(l££fiJXiL + ^) 
I/Os. 

Our results leave open two problems: first, fully deamortiz- 
ing range queries in the CO dictionary, and second, achiev- 
ing the e-dependent bounds without knowledge of B. 

2. PRELIMINARIES 

2.1 Key and Version Ordering 

We often discuss ordering elements lexicographically by key 
and version {'kv order'). Keys are assumed to have a nat- 
ural total ordering, so we shall describe the version order- 
ing. The versions are nodes in a version tree, so we have 
the ancestor partial order ^ - we write x ^ y to mean 'e 
is an ancestor of y'. We say that versions x,y are com- 
parable if either x < y or x >^ y. For kv order, we al- 
low any total order consistent with >^ in the sense that ev- 
ery version v occurs after all descendants w > v. Order- 

^Typically, B is thousands in practice and log^ N < 5. 



ing versions descending by their DFS number satisfies this, 
with the advantage that ancestorship can be tested in 0(1) 
time: let the interval I{v) — [DFS(w), max„^„ DFS(u))], 
then V ^ w DFS(iu) G H'")- As the version tree 

changes, we can use an efficient renumbering scheme to re- 
tain integer DFS values, such as in the order maintenance 
problem [4j. 

2.2 Definitions 

Consider a set of elements A and versions V. An element 
(fc, v) is a lead element (at v) if v £ V. Define lead{A, v) as 
the total number of lead elements at u in A and lead{A, V) = 
"^^^y lead{A,v). The lead-below count is the total lead at 
versions descendent from v, i.e. lead_below{v) — X]i>-t, lead{v). 

An element (k, x) is said to be live (or accesstble) at version 

V in A if X ^ V and k has not been rewritten between x and 
V, i.e. there is no other element {k, y) £ A with x < y < v. 
Let live{A, v) be the total number of elements of A that are 
live at V. Note that if v ^ w then live{v) < live{w). Also 

liveiv) < live(parent(v)) + lead{v), (1) 

with the difference between right and left-hand sides being 
equal to the number of keys k which appear in both versions 

V and parent{v) . We use A'^ to denote the total number of 
keys written; for a version v, we use A''„ to denote the number 
of keys that are live at v, i.e. the number of distinct keys 
written in ancestor versions of v (each key is live at least 
once, so E„ > N). 

We assume that keys and values (which could be pointers to 
data or real data) are all of fixed size. 

3. A CACHE-OBLIVIOUS VERSIONED B- 
TREE 

In this section we present a cache-oblivious versioned B-tree, 
which we refer to as a stratified doubling array (SDA). It 
contains a collection of arrays of key-version-value tuples, 
arranged into levels, with 'forward pointers' to facilitate 
searching. Arrays in level / are roughly twice as large as 
arrays in level I — 1, hence 'doubling', and have disjoint sets 
of versions associated to them, hence 'stratified' in version 
space. 

The basic idea is to store arrays of fev-ordered elements, as 
in the COLA of Bender et al. except that we apply 
a version split process, similar to the one employed in the 
versioned B-tree, albeit more complex, in order to avoid ar- 
rays containing too few elements from some version (we call 
this a 'density' property). The result is that each level may 
have several arrays, tagged with disjoint sets of versions that 
indicate which should be used. 

3.1 Arrays 

An array {A, V) contains a set A of entries (fe, v, x) where 
fe is a key, v is a version, and x is either a data value or 
a forward pointer containing an array index (the array into 
which it indexes will become clear from the context later), 
ordered by {k,v). The set V is a, set of 'valid versions' that 
will be used for lookups and merges between various arrays. 
Each array also contains a pointer to a unique 'next array', 



identifying the array, if any, into which its forward pointers 
point. Arrays implement the following operations: 

• search (k,v, [lb] , [ub] ) : search for a (fc, v) pair, within 
optional lower and upper bounds. It returns the index 
of a least upper bound y for (fe, v) in the k-v order, 
and the destinations of the two closest forward point- 
ers either side of y. 

• iterate (loc) : provides an iterator over elements start- 
ing from index loc. 

• appendCk, v,x) : appends the entry to the end of the 
array, returning its location. 

3.2 Definitions 

For a version v, the density of version w in yl is S{A,v) = 
live{A,v)/\A\. We say that a version v is dense in A if 
S{A,v) > 1/3, and that an array {A,V) is dense if every 

V £ V is dense in A. Note that if v is dense in {A, V) then 
every descendant version is also dense there. 

Given a non-empty set of versions V, we say a version v is 
an orphan of V if it has no strict ancestor in V . We say 
the array {A, V) is a stratum if the orphans of V are all 
siblings - they have the same parent, not in V, which we 
write without ambiguity as parentiV); 

For a version v and set of versions V , let Tv[v\ — {w £ V : 

V ^ w} he the subtree of V rooted at v. For W dV a set of 
versions and A an array, define the split of A with respect 
to W to be the set of all entries live in any version in W: 
\{A,W) = {{k,x) G A : {k,x) is live at some v £ W}, i.e. 
the set of all keys live in any version in W . For W a stratum 
with orphans Wi having common parent p, define 

arr_size(A, W) ~ live(A,p) + lead(A, W) 

(2) 

— live{A,p) + lead_below(A,Wi) 

As in H]), \\t{A, W)\ < arr_size(yl, W) with the difference 
being those keys live in the parent version but over-written 
in all orphans of W . 

As a special case, when W — Tv[v] for some version v £V, 
define Xt{A,V,v) = X{A,Tv[v]), and as usual where A 
and V are clear, we write T[v] and \t{v) for the set of 
versions and corresponding split respectively. Note that 
lead{A,Tv[v]) — lead_below{A,v). 

A version split of an array {A, V) gives a set of strata {{Ai, Vi)}i 
such that A — UiAi, and V = UiVi, and Vi are mutually dis- 
joint. 

3.3 Levels 

As previously mentioned, an SDA keeps (k, v, x) tuples in 
arrays arranged into levels. Each level / > contains a set 
of arrays (Aj,l^/) with disjoint sets of valid versions. We 
keep in memory a map from version to the array in which 
it is valid - if such a thing exists. We also keep track of the 
subset of those versions (which we call 'real') for which there 
is at least one lead key in the array where v is valid. 



3.3.1 Promotion Conditions 

Before describing invariant properties of levels, we introduce 
the following logical conditions on arrays and versions: 

• We try to ensure that arrays at level I have sizes in the 
range 2* < < 2'"*"^; we refer to these size conditions 
as (P-min-sizo)i_A : > 2', and its contrary (P-max- 
size)(,A = -i(P-min-size)(+i,A; 

• It will be important for all arrays, both those in a level 
and those being promoted, to have a suitably large 
number of keys live in each version; such a lower bound 
on the number of live keys clearly implies a density 
constraint given the size constraints above. Formally 
we'll refer to (P-live)i,„ : live{v) > 273. 

• Likewise it often turns out to be important that no 
strict ancestor of an array has so many keys live. The 
intuition here is that we want to ensure that there are 
not too many copied keys in an array so that merges 
can be 'paid' for by insertion of lead keys. The con- 
straint we'll refer to is simply the contradiction of P- 
live: (P-plive)i,v = ^(P-livc)i+i,p„„nt(v); 

• In order to be able to argue that the amount of merge 
work done is bounded by a linear function of the num- 
ber of keys inserted, we will insist on a lower bound 
on the number of lead keys in each promoted array: 
(P-lead)i,^ : leadJbelow{v) > 2'+V3; 

• Putting it together, when searching for arrays to pro- 
mote we will look for a version v whose subtree satisfies 
the conjunction of the above properties: (P-prom);_„ = 
(P-live)(+i,^ A (P-lead)(+i,^ A (P-min-size)(+i,Ar(^). 

The following properties hold for every array {A, V) at level 
I with at least one real version (i.e. version with lead > 0): 

• (L-dense) A is dense for all versions in V. 

• (L-size) A is not too big: (P-max-size)(,A; 

• (L-live) A has a minimal number of keys live for every 
version: (P-live);,^ for all v €V; 

• (L-plive) The parent of A has few keys live: (P-plive);,^^; 

• (L-no-prom) There are no versions in A which are 
'promotable' in the sense used above: Vv € V,-i (P- 

prom)!+i,„ 

• (L-edge) If there are enough keys in version v to justify 
promotion (ignoring the lead requirement) then no de- 
scendant can have reached a strictly higher level yet: 
Vv £ V, (P-live)i+i,„ => no level /' > / contains a 
key in a version strictly descendant from v. 

We will show how to maintain these properties under up- 
dates. 



3.4 Updates and promotions 

Promotion is the process by which an array is moved up 
from one level to the next and merged with an existing array 
there. The update (k,v,x) operation is itself considered to 
be a promotion - the promotion of the singleton array A = 
[{k,v,x)] with valid version set V = {v}, to level 0. In 
general, the only sort of array (A, V) that will be promoted 
from one level to the next is of the form Xt{v) for a suitable 
V. The properties we are interested in for {A, V) are: 

• V has a unique orphan v, which makes the choice of 
array with which to merge simple; 

• A has a large fraction of lead keys, which allows us to 
account for the cost of merging and splitting; 

• versions obey a density property, which in conjunction 
with density for existing arrays in the target level al- 
lows us to maintain (L-live). 

We will show that the result of this merge can bo 'version 
split' into new arrays which are either suitable to remain at 
level I, satisfying the level requirements above, or are suit- 
able to be promoted to the next level l + l. Often the result 
of the merge need not be split at all, or can be promoted 
in its entirety. To be precise, an array {A, V) promoted to 
level / will satisfy the following 'promotion conditions': 

• (P-orphan): V has a unique orphan v; 

• (P-non-trivial)„ : lead{v) > 0; 

• (P-max-size)i,yi {A is not too big); 

• (P-plive)(,t^ {A's parent has few live keys); 

• (P-prom)(^„ (... but yl itself is promotable: it has large 
enough live and lead counts, and is big enough); 

• (P-edge): no level I' > I contains a key in a version 
strictly descendant from v. 

Note that a single insert into level satisfies these condi- 
tions. 

3.5 Algorithm Overview 

The choice of which array at level I to merge {A, V) with 
orphan v into is simple: if v is registered to some array 
A', then merge into that; else, if the next array to which 
forward pointers in A point is at level I, then merge into 
that; otherwise, there is no suitable array: A' = 0. 

Our general approach is to first calculate an appropriate 
sot of output version sets, based on load and live statistics 
for the input arrays A and A'; to each such version set we 
will associate an output array, initially empty; then we will 
iterate over the contents of the input arrays in k, u-order, 
appending each entry to appropriate output arrays. 

This process is I/O efficient, since it requires one complete 
sequential read across each input array, and sequential writes 
to each output array. Importantly, for practical implemen- 
tations, with sufficient prefetching and buffering, it can take 



advantage of sequential I/O. After the output arrays are gen- 
erated, forward pointer sample arrays will be back-propagated 
down towards level 0, and at most one will be promoted up 
to the level above. Thus the merge operation can be decom- 
posed into the following phases: 

1. seek a promotable version set, and remove it if one 
is found; 

2. perform a version split of the remainder; 

3. execute the resulting promotion and version split; and 

4. back-propagate forward pointer arrays. 

We now describe each of these phases in detail. 

3.5.1 Finding a Promotable Version 
We say a version w £ V" is promotable if \t{w) obeys 
the promotion conditions for level I + 1, most importantly 
(P-prom);+i j^y . Using the statistics of the merged array 
(A", V"), we search for the promotable version w for which 
|At(w)| is the largest possible. This can be done by search- 
ing recursively down through V" , starting with whichever 
orphan z of V" is ancestral to w. Note that once the (P- 
lead) or (P-min-size) conditions fail, the whole search fails, 
since both are non-increasing down the tree. 



Algorithm 1 f ind_promotable(VP', ui) 
Require: A threshold size A/ 

1; if |At(u^)| < M or lead_below{w) < 2M/3 then 

2: return null 

3: else if lead{w) > and live(w) > M/3 then 
4: return w 
5: else 

6: for u in the children of w do 

7: let « = f ind_promotable(Vl^, It) 

8: if u' is not null then 

9: return u' 

10: end if 
11: end for 
12: return null 
13: end if 



Pseudo-code is given in Algorithm [T] We search for a pro- 
motable version w = f ind_promotable(T/", with M = 
2'"*"^. If w is null, then we proceed to the version split phase 
using {A",V"), otherwise we remove the subtree rooted at 
w from V" before proceeding, using the suitably diminished 
counts for leadJ}elow{-) on the remaining versions. Both the 
elements extracted by f ind_promotable and the remainder 
satisfy the desired properties. 

Lemma 1. Suppose {A" , V") is the result of merging a 
promoted array {A,V) into an existing array A' at level I. 
Let z be the unique orphan of V" ancestral to the orphan 
of V , and w = find_promotable(V",z) 7^ null. Then (1) 
{Ap,Vp) — Xt{A" ,V" ,w) obeys the promotion conditions 
at level I + 1; and (2) the remainder {Ar, Vr) of {A" , V") 
after Ap is removed, obeys (L-no-prom) . 



Proof. First note that the algorithm guarantees (P-non- 
trivial)u, and (P-prom);+i,u,, and that (P-orphan) is obvious. 

The version w cannot be a strict descendant of v, since in 
this case (P-edge) for A would imply that Ap C A; (P-min- 
size) cannot hold at level I -\- 1 for a subarray of A, for which 
(P-size) holds at level /. Likewise w cannot be unrelated 
to V, since in that case \t(A" ,w) — Xt{A',w) and so (P- 
prom);+i,u, is in contradiction to (L-no- promote) . So w is a 
weak ancestor of v. 

If w = V then (P-edge) at level / for A implies (P-edge) at 
level I + 1, since it's a weaker constraint; ii w ^ v then the 
number of keys live for w has not increased as a result of 
promotion, and so (P-live);+i_„ must have held prior to the 
promotion of A; the necessary (P-edge) constraint is then a 
consequence of (L-edge) prior to promotion. 

The only remaining condition which remains to be checked is 
(P-max-size)i+i.Aj5 - that Ap it is not too large. However, 
Ap C A" and '\A"\ < \A\ + \A'\ < 2'+^ by (L-size) and 
(P-max-size) for A' and A respectively. 

The second part of the lemma follows from the fact that 
f ind_promotable finds the oldest promotable version: sup- 
pose w' G Vp has at least 2'"'"^/3 keys live in Ap. We have 
w' ^ w since otherwise f ind_promotable would have chosen 
w' before reaching w in the search ((P-lead);+i,u, (P- 
lead);+i and likewise for (P-min-size)). We have w' w 
because no such versions are in Vp. Therefore, w' is in- 
comparable to w and also v, since w < v. But the version 
statistics of such versions are unaffected in the merge, so 
(L-no-promote) must hold post-merge since it held for A' 
before. □ 

3.5.2 Version Split 

Now we describe the version split process that splits the re- 
maining array (after optionally removing a promotable ar- 
ray) into a collection of arrays, each of which obeys both 
the minimum density constraint (L-dense) and a minimum 
fraction of lead elements. 

We use the notion of versions that are dense in their subtrees, 
i.e., V for which 5t{v) := 5( At («),«) > 1/3. Intuitively, the 
split \t{u) of a version dense in its subtree is easy to deal 
with from the point of view of density, but need not contain 
enough lead elements; on the other hand, if v isn't dense, but 
must have a good lead ratio - in order for u to not be dense, 
there must be many lead keys strictly descendant from u. 

We show how to construct a split by following the 'least 
dense' child down the version tree, until we find a version 
u not dense in its subtree, but all of whose children are 
dense in their subtrees (see Figure [l]). It is not difficult to 
see that this always terminates; the difficult part is showing 
that this process finds a version u with children ui . . .Ur and 
a split X{A" , UiT[ui]) with enough lead elements and where 
all versions are dense. Removing the split subtree ui . . .Ui 
and recursing gives a collection of splits as required, all of 
which obey the required density property and all but the 
last of which (for which no suitable u can be found) obey 
a lead ratio requirement. The version-split algorithm is 
shown in Algorithm [3] 




u 

Figure 1: The version split process, starting from 
the orphan w. Filled nodes are dense in their sub- 
trees. 



Algorithm 2 f ind_dense_kids(iti , . . . , Un) 

1: let It = argmin(5T(itO 
2: if 5t{u) > 1/3 then 

3: return [iii, u„] sorted by /ead_6e/o«;() decreasing 
4: else 

5: return f ind_dense_kids(children of u) 
6: end if 



The proof of the fact that version_split provides a version 
split with the desired properties is deferred until the full 
paper. Here we simply state the result: 

Lemma 2 (Version Split). Suppose that {A,V) is a 
stratum obeytng (P-plive)i+i^v o,nd (L-no-promote). Then 
there ts a version split of {A, V), say {Ai, Vi) for i — 1 . . .n, 
such that each array satisfies (L-dense) and (L-size) for level 
I, and there is at most one index i for which lead{Ai) < 
1^.1/2. 

If {A, V) also satisfies (L-live) then every split of it does 
(since all live elements are included), and likewise for (L- 
edge). It follows that version splitting {A",V") - which 
necessarily has no promotable versions - results in a set of 
arrays all of which satisfy all of the L-* conditions necessary 
to stay at level I. 



Algorithm 3 version_split(j4, V, /) 
Require: (A, V) is a stratum. 

1: let [til, . . . ,Ur] = f ind_dense_kids(orphans of V) 

2: let split(j) = U^<JT[ui] 

3; for j = 1 to r — 1 do 

4: if \\{A, split(i))| > min(2'+\ 3 • live{ui)) then 

5: let (7 = split(i - 1) 

6: return version_split(V \ ?7, Z) :: U 

7; end if 

8; end for 

9: return [split (r)] 



The main result of this process is the following. 

Lemma 3 (Promotion). The fraction of lead elements 
over all output arrays after a version split is > 1/39. 

Proof. First, we claim that under the same conditions 
as the version split lemma, if in addition \A\ < 2M and 
live(v) >= M/3 for all v, then the number of output strata 
is at most 13. Consider the arrays which obey the lead 
fraction constraint. Each has size at least M/3, since at 
least one version is live in it, and least half of the array is 
lead, so at least M/6 lead keys. The total number of lead 
keys in the array A is < 2M, since the array itself is no 
larger than this; it follows that there can be no more than 
12 arrays obeying the lead ratio constraint, and hence no 
more than 13 in total. 

Now, a merge at level I involves at least one promoted array, 
which by (P-lead) contains > 2'"'""'^/3 lead elements. By the 
above, the output of the merge is at most 13 arrays of size 
at most 2*^^, so there are at most 39 output elements per 
lead element. □ 

3.5.3 Extraction 

Extraction is the process of executing a version split found as 
in the previous section: it takes a list of disjoint version sets 
{Vi}i, an iterator it of {k,v,x) tuples (in k,v order), and 
outputs a set of arrays {\{it, Vi)}i together with forward 
pointer arrays demoted to lower levels. 

For each version set Vi we create a set of output arrays 
Al, one for each level. A^ is the primary output array and 
will receive keys in version set Vi and will end the process 
containing \[A" ,Vi); the arrays A^ for j > Q will contain 
forward pointer samples of this array: if we are sampling 
with frequency r, then A^ will contain a pointer to every r*^ 
element of A^"^ . 

3.6 Lookup 

A point query for k,v calls query_rec(yl, fc, w) (see Algo- 
rithm where A is the unique array registered to version 

V at level 0. In general, at level I we search within a lower 
and upper bound to find the least upper bound {k' ,v' , x) for 
(fc, v) and the associated forward pointers strictly below and 
weakly above this location. If fc' = fc and v' ^ v, then the 
least upper bound is the desired key and we return x (by 
the ordering on versions, v' must be the closest ancestor of 

V for which a value of k has been written); if v' v, then 
scan forwards until an ancestor is found, in which case we 
return it, or we reach an entry for which fc' 7^ fc, in which 
case we recurse to the array to which forward pointers in A 
point. The search terminates either when a suitable entry 
(fc, v' ,x) is found, with v' -< v, or when there are no forward 
pointers in A. 

A range query queryCstart , end, version) is handled by 
performing a lookup query (start .version) (with the mod- 
ification that we do not break out of arrays early as in the 
lookup described above) . We then merge the outputs of the 
iterators from each of these arrays in (fc, w)-order, with the 
exception that for any key fc, we output only the first version 



Algorithm 4 query_rec(yl, k, v, [lb], [ub]) 

Require: An array A, and optionally two locations within 
A: lower bound lb and upper bound ub 
1: let loc, lb, ub = yl.search(fc, v, lb, ub) 
2: let it — A. iter ate (loc) 
3: let k' = k 

4: while jt.has_next() and k' — k do 

5: let (k ,v ,x) = if. next () 

6: if k' = k and v' ^ v then 

7: return {k,v',x) 

8: end if 

9: end while 

10: let be the next array of A 

11: if iV / null then 

12: return q}ierj_rec{N,k,v,lb,ub) 

13: else 

14: return null 
15: end if 



ancestral to the desired version, and skip over the remaining 
versions. 

To get the desired lookup performance, we need to modify 
the forward pointer construction. In the description here, 
FPs may not be evenly spaced within an array; in particular, 
for increasing inserts, all the FPs live at the start of each 
array and a lookup always involves a scan to the end of some 
arrays. This can be solved by storing, for some constant 
8 < fc < B, in every kth element of every array a redundant 
FP, which is a copy of its two closest real FPs to the left 
and right. This guarantees that every element has a forward 
pointer within 0(1) blocks on either side. Space for these 
redundant FPs can be left in the initial output during the 
execution phase, and their values retrospectively updated by 
rescanning each output array A° {A^ for j > consists only 
of forward pointers, and so has no such problem). 

3.7 Clone 

On snapshot or clone of version v to new descendant ver- 
sion v' , V is registered for each array A which is currently 
registered to the parent of v. This does not require any I/Os. 

3.8 Update 

Theorem 1. The stratified doubling array performs up- 
dates to a leaf version v in a cache- oblivious 0{\ogN^/B) 
amortized I/Os. 

Proof. Assume we have at our disposal a memory buffer 
of size at least B (recall that B is not known to the algo- 
rithm). Then each array that is involved in a disk merge 
has size at least B, so a merge of some number of arrays of 
total size k elements costs 0{k/B) I/Os. In the COLA [5], 
each element exists in exactly one array and may participate 
in 0(log7V) merges, which immediately gives the desired 
amortized bound. In the scheme described here, elements 
may exist in many arrays, and elements may participate in 
many merges at the same level (eg when an array at level 
I is version split and some subarrays remain at level I after 
the version split). Nevertheless, we shall prove the theorem 
using a more involved accounting argument. 



We will assume that each I/O costs $1 and can read or 
write B elements. Each element (k, v) inserted at version 
V has an initial credit %c/B, for some constant c to be deter- 
mined later. For an array {A,W), recall an element (fc, w) 
a lead element if w £ W . When an array [A, W) is pro- 
moted from level Z to / -|- 1, all its lead elements are given 
extra credit %c/B. Assume for now that this is sufficient 
to pay for all I/O operations. By the level condition (L- 
live), all arrays [A, W) with v £ W must live in levels 
Z < IgAt, -I- 0(1). This implies that the total charge to 
element (k,v) is 0{log / B), since {k,v) appears as a lead 
element in exactly one array, is only charged when it ap- 
pears as a lead element (hence it can only be charged at 
those levels where v G W), and lead elements can never be 
demoted (so it is charged at most once per promotion). It 
now remains to prove the assumption. 

Lemma 4. For c > 45, the credit of every element (k, v) 
IS > 0. 

Proof. Each array is either dead or alive. An array at 
level I is alive if it enters level I by being promoted from 
level Z — 1, and becomes dead at level Z if it enters level Z 
as a result of a version split. Consider a merge at level I. 
The algorithm guarantees that at least one of these arrays 
is alive. We will charge the entire cost of the merge to the 
lead elements in the alive arrays participating in it (if there 
is more than one such array, divide the cost equally between 
their lead elements). 

Consider a merge at level Z. It involves 0(1) passes over 
0(1) input arrays, at least one of which is alive (otherwise 
the promotion would not have been triggered) , followed by a 
version split which produces some output arrays. The Pro- 
motion Lemma implies that for c > 39 , the lead elements in 
the alive array can pay for all the I/Os involved in producing 
the output arrays. Since the input array has just been pro- 
moted, by (P-lead), it has at least 2'+^/3 lead elements. The 
total input size is at most 2'^^ , so to perform the input and 
output passes, it suffices to ensure that c > 39 -f- 6 = 45. □ 

The theorem follows since the lead elements of alive ar- 
rays can pay for all the I/Os involved in merging and split- 
ting. □ 

3.9 Lookup 

For large range queries (that retrieve a constant fraction 
Z — r2(A^„) of the live keys of some version v), the density 
property of the arrays immediately gives an asymptotically 
optimal bound of 0(log Nv -\- Z/B). For much smaller range 
queries, the worst-case performance may be the same as for 
a point query. We now prove the amortized bound, which 
applies to smaller queries. 

Theorem 2. A range query at version v costs 0(log7V„-l- 
Z/B) amortized I/Os. 

Proof. We first consider just point queries, and amortize 
the cost of lookup{k, v) over all keys live at v. Let l{k, v) be 
the cost of lookup{k, v), then the amortized cost is given by 

j:jik,v)/N^. 



For an array Ai, let l{k,v,Ai) be the number of I/Os used 
in examining elements in Ai for lookup(k,v). The idea is 
that since the elements of Ai are (fc, ?;)-ordered, the parts of 
Ai examined by each key lookup for version v are disjoint, 
hence l{k, v, Ai) < \ Ai\. We have the following; 

< E,l^.l+EfcE.o(i) 

= OW±MiYli2£iM = o(log7V„). 

The 0(1) additive term is due to finding and following FPs 
between arrays, which follows from the redundant FP con- 
struction. The second inequality follows since there are 
0(log A'^t,) arrays examined during the searches for all keys in 
some fixed version v, and the first term follows from the den- 
sity property and the geometrically increasing array sizes. 

A range query incurs the same initial lookup cost (to locate 
the starting points of the iterators in each array), and then 
the subsequent scan can be analysed in the same way as 
above. We can also deamortize the point query bound to 
worst-case 0(log A^^) by embedding a small lookup structure 
into the part of each array defined by each key k. The proofs 
will appear in the full version of the paper. □ 

3.10 Space 

We can prove that the structure has asymptotically optimal 
space requirements. 

Theorem 3. The structure uses 0{N) external memory 
space to store N elements, regardless of the version tree. 

Proof, ft is easy to see that there are exactly A'^ lead 
elements. Whenever an array containing k lead elements is 
promoted, Theorem [T] established that at most 0{k) space 
is used and possibly never freed (used by dead arrays). Each 
lead element gets promoted exactly once and by the promo- 
tion conditions, the number of lead elements must double 
between successive promotions. Thus the total space is at 
mostO(Ei>oA^/2') = 0(iV). □ 

3.11 Deamortized updates 

A single update in the SDA algorithm may trigger a cascade 
of merges, in the worst case requiring fl{N/B) 1/Os. With 
some effort, we can deamortize the merge and insert pro- 
cesses, such that lookups are valid at every point in time, 
and an insert to version v takes worst-case 0(logA''i,) fOs, 
and amortized 0( ) lOs as before. 

3.12 Cache-aware tradeoffs 

Examining the level conditions in Section 13.3.11 arrays ap- 
proximately double in size between successive levels. Simi- 
larly to Bender et al. [5], we can obtain a range of 'cache- 
aware' query/update tradeoffs by selecting a 'growth factor' 
g = > 2. Some work is needed to ensure the version 
splitting process, and we can obtain the following results, 
the proof of which is deferred to the full paper. 



Theorem 4. The data structure can support updates to 
version v in amortized 0{ ) I/Os, and range queries 

of size Z in amortized Q(^l2EMlh. + ^) I/Os. 

4. EXPERIMENTAL RESULTS 

We implemented a prototype of the versioned B-tree and the 
cache-oblivious versioned B-tree in OCaml, an efficient com- 
piled functional language. The machine had 1GB RAM (but 
we restricted 256MB to be available for the buffer cache in 
the tests), a 2GE[z AMD Athlon 64 processor (although our 
implementation was only single-threaded), a 500GB SATA 
disk and an Intel X25-M SSD. We used a block size of 32KB 
for the B-tree on disk and 4KB on SSD. The disk can per- 
form ~ 50 such I/Os/s; by contrast, the SSD can perform 
35,000 random 4KB reads/s but must write in blocks of 
512KB; however some buffering tricks in the firmware allow 
writes of 4KB blocks. Q There was no such tuning to be 
done for the cache-oblivious structure. 

We started with a single root version and inserted random 
100 byte key-value pairs to random leaf versions, and peri- 
odically performed range queries of size 10,000 at a random 
version. Every 100,000 insertions, we create a new version as 
follows: with probability 1/3 we clone a random leaf version 
and w.p. 2/3 we clone a random internal node of the version 
tree. This aims to keep to the version tree 'balanced' in the 
sense that there are roughly twice as many internal nodes 
as leaves. 

The figures show results for the versioned B-tree (btree), 
the SDA (strat-DA) and, for comparison, the SDA where 
we forbid any version splitting; hence there is a single ar- 
ray at each level. Figure [5] shows the insertion performance 
on the disk. As expected, the B-tree performance degrades 
rapidly when the dataset exceeds internal memory available. 
Figures O and m show range query performance on disk and 
the SSD. The SDA beats the B-tree and DA by a factor of 
more than 10 on both disk and SSD, while the versioned 
B-tree beats the non-version-split DA on SSD, likely due to 
excellent random read performance (on disk, the overhead 
of scanning over irrelevant versions in the DA appears to be 
low compared to the overhead of performing random reads). 
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APPENDIX 

A. PROOFS 

A.l Version Split Lemma 

We prove the Version Split Lemma, which is crucial for main- 
taining density within levels, and for bounding the amount 
of merge work we need to do. We first demonstrate that 
there is some redundancy in the P-* conditions: 

Lemma 5. // {A,V) is a stratum for which (P-plive)i^v 
and (L-no-promote) hold, then (P-Uve)i+i^y {P-max- 

live{v) > 2'+V3 =^ \\{A,TVv)\ < 2'+\ (3) 
In particular, all such versions are dense in their subtrees. 

Proof. Suppose by way of contradiction that there exists 
a version w such that live{w) > 2'+V3 and \X{A,TVw)\ > 
M. Let V be the oldest ancestor of in V for which live{v) > 
2'+V3. Since wis an ancestor of xy, \\{A,TVv)\ > \X{A,TVw) 
and so (P-live) and (P-min-size) both hold for v. By (L-no- 
promote), lead_below{v) < 2^+^/3. Whether v is an orphan 
of V (in which case (P-plive)!,v is needed) or not, it must 
be the case that live{parent{vj) < 2'+V3, so \\{A,TVv)\ < 
live {par ent{v)) + leadJbelowiv) < 2'"''^, in contradiction to 
(P-min-size). □ 



The following lemma forms the basis of the Version Split 
Lemma. 



Lemma 6. Suppose {A, W) is a stratum m level I such 
that, for M = 2'+^ we have 

1. (cD) live(v) < Af/3 for v = parent{W) and all ver- 
sions V £W that are not dense m their subtrees; and 

2. (cS) \\{A,TVv)\ < M for all v e W such that v is 
dense in its subtree. 



density implies that |A(A, T1/mi)| < 3live{ui). Therefore 
the size test on line 4 of the algorithm must evaluate to 
false, and we never split at Ui. 

If the test evaluates to true for i > 1 then, since it was 
false at i — 1, the array U constructed in line 7 must satisfy 
I ^(-4, (7)1 < M (vS), and is dense for aU versions (vD). This 
also holds if the loop exits without having found an upper 
bound i: U — split (r) is still small enough and dense for 
all versions. In this case we are done since the version split 
list has a single entry, so (vL) holds. This is the base case 
for induction, since it exhausts W. 

In the former case, where split (i) fails the size test, the 
lemma will proved by induction so long as we can establish 
firstly that the conditions of the lemma still hold for W\U, 
and secondly that (vL) holds for U, for which it suffices to 
prove that live{u) < lead{U), since |A(A, C/)| < live{u) -\- 
lead{U). 

The condition (cD) is trivially maintained, since the set of 
versions not dense in their subtrees can only shrink. For 
(cS), we have to check that if is a version in W \ U, not 
dense in \{A,TWv), but made dense in X{A,TW \ Uv) by 
the removal of keys with versions strictly descendant from 
V, then \\{A,TW \ Uv)\ < M (to maintain (cS)). However, 
since such a version is not dense prior to the removal of 
Ai, it follows from (cD) that live{v) < M/3, and so post- 
hoc density implies |A(^,rW^\ Uv)\ < 3 • live{v) < M as 
required for (cS). 

We now return to the case where there is an i > 1 such that 
U — split (i — 1) passes the size test, but U' — split (i) 
fails. Failure at i implies either that there is a j < i such 
that Uj is not dense in U' , or that > M. In either case, 
since Uk are ordered by lead_below{) decreasing, 

i 

lead{U') = leadJ}elow(uk) 



Then version_split{W, I) gives a version split Wi of [A, W) 
such that the associated extracted arrays Ai = \{A,Wi) sat- 
isfy: 

1. (vS) \Ai\ < M for alii; 

2. (vD) Ai IS dense for all versions m Wi; 

3. (vL) > i for all but at most one i. 

Proof. Proof is by induction on the size of W . Consider 
a pass through the version_split algorithm. 

The subroutine f ind_dense_kids returns the children ui . . .Ur 
of some version u, such that all Ui are dense in their sub- 
trees; the children are ordered decreasing by lead_below. It 
returns the orphans of W iff all orphans of W are dense in 
their own subtree; in this case u — parent{W) and it is not 
known whether u is dense in A or not; in all other cases v is 
not dense in its subtree. 



< 



fe=i 

< 2lead(U) 
lead{U) > izead([/') 



— lead_below(uk) 



(4) 



If |?7'| > M then we can use (cD) directly: live{v) < M/3, 
and so lead{U') > \U'\ - live{v) > 2M/3, which implies 
lead{U) > M/3 from @, and so (vL) holds for U. If on the 
other hand Uj is not dense in U' then 



3live(v) 



2live(v) 



< 3live{uj) 

< \u'\ 

< live{v) -\- lead{U') 

< lead{U') 

< 2lead(U). 



In either case live(v) < lead{U) and therefore (vL) holds. □ 



As a particular case of the density of the m, ui is dense in 
its subtree, which from (cS) means that |A(A, rVwi)| < M; 



Now we apply this lemma to prove the version split lemma. 



Proof of Version Split Lemma. To prove the lemma 
we must establish that (cD) and (cS) are a consequence of 
(P-plive);+i_v' and (L-no-promote). (P-plive);+i_v' is exactly 
the first clause of (cD), namely live{parent{V)) < M/3. 
Consider any version v. From LemmaO v is dense in its sub- 
tree whenever liveiy) > M/3, the contrapositive of which is 
that whenever v is not dense in its subtree, live{v) < M/3, 
i.e. (cD). On the other hand, if v is dense in its subtree 
and \X{A,TVv)\ > M then by the definition of density, 
live{v) > M/3, a contradiction to Lemma[S] Thus, we must 
have \\{A,TVv)\ < M, proving (cS). □ 



